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Abstract. The proliferation of media sharing and social networking 
websites has brought with it vast collections of site-specific user gen- 
erated content. The result is a Social Networking Divide in which the 
concepts and structure common across different sites are hidden. 
The knowledge and structures from one social site are not adequately 
exploited to provide new information and resources to the same or dif- 
ferent users in comparable social sites. For music bloggers, this latent 
structure, forces bloggers to select sub-optimal blogrolls. However, by 
integrating the social activities of music bloggers and listeners, we are 
able to overcome this limitation: improving the quality of the blogroll 
neighborhoods, in terms of similarity, by 85 percent when using tracks 
and by 120 percent when integrating tags from another site. 
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1 Introduction 

The increasingly growing collections of user generated content spread over het- 
erogeneous social networking and media sharing platforms, each supporting spe- 
cific media types. The content typically has a latent structure and latent, inter- 
related topics: this has resulted in a Social Networking Divide. 

Recent advances toward a more Open Social Networking (OSN) paradigm are 
focused on (de facto) standards, and only address part of the problem. Specifi- 
cally, current OSN efforts attempt to handle issues related to the portability of 
data, common APIs (e.g., Google OpenSocialJ), and social graphs, e.g., FOAF0, 
XHTML Friends Network 

We posit that open social networking is more than an agreed upon "language" 
for describing relationships and sharing data across systems. In addition, it is 
the exploitation of social activities in one site, to support the discovery of new 
interrelationships within a community. This is crucial, given that it is becoming 



1 http : / /code . google . com/opensocial 

2 http://www.foaf-project.org/ 
J http://gmpg.org/xfn/ 
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increasingly difficult for seekers to cope with the cognitive challenges of efficiently 
finding and effectively analyzing relevant information, when inundated with its 
volume, variety and evolution. 

1.1 Scenario 

To motivate the aforementioned ideas, consider the following scenario in which 
there are two social network sites. In one site, a Blogger. corr^], music community, 
the main activities are writing text about artists, tracks, albums or music videos. 
Bloggers create explicit links to other participants to express their preferred blogs 
(via a blogroll). In the other site: Last.fnQ the users listen to music tracks, tag 
these tracks and build friendship relationships. 

Symbiotically, the social activities in one site can have an impact on the 
other site. Blogger.com bloggers do not tag the entities about which they write. 
However, the tagging activity can better help bloggers see the structure in their 
community and find new information. Conversely, Last.fm users do not provide 
prose for the tracks they listen to, but such prose can be a valuable source of 
metadata for audio tracks. 

In Figure [T] a navigation tool is depicted, in which the Blogger.com site has 
been enriched with information from Last.fm. The graph represents the simi- 
larity between (potentially unknown) bloggers on the set of tracks they have 
written about in their blogs. By selecting a node in the graph, the list if tracks 
that blogger has mentioned in their blogsite is presented, along with the overall 
popularity of these tracks. Also depicted in the figure are the tags from Last.fm, 
which can be used to filter the nodes and edges in the graph. The tool sup- 
ports navigation and visualization of latent concepts and relations within, and 
across the sites. One of the challenges in realizing such a scenario is that the con- 
cepts and structure within — and across domains — are latent. Specifically within 
blogs, the topics to which the blog site is devoted, are very often not made ex- 
plicit. Furthermore, the readership relationship is typically unobserved; and the 
blogroll relationship, though observed, is often unexplained. Finally, standards 
may support, but do not address, how the social practices in one domain may 
be exploited to support bloggers in another similar social site; particularly when 
resources of different types are being mapped. 

The contributions of this work are: 1) extension of the commonly held view 
to open social networking: to infer new relationships and resources, and provide 
support with navigation and visualization across comparable social networks; 2) 
integration music bloggers with music listeners; bridging the gap between differ- 
ent types of social networking systems; 3) examination of the blogroll relationship 
in the music domain based an open social networking approach. 

In Section[2]we discuss related research in cross domain discovery. In Section[3] 
we present the conceptual approach to open social networking in the music 
domain. In Section 0] experimental results are presented and in Section [5] we 
conclude and discuss future work. 

4 http : / /www . blogger . com 

5 http : / /www . last . fm| 
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Fig. 1: Cross System Visualization Blogger.com and Last.fm 



2 Related Work 

In machine learning, cross domain discovery [Ij or domain adaptation [2] is a 
body of work in which multiple information sources from comparable, but differ- 
ent domains are combined. Work done in this area has focused on classification 
tasks, in which labeled data exists in abundance in one domain, but a statistical 
model that performs well on a related domain is desired. Since hand-labeling in 
the new domain is costly, one often wishes to leverage the original out-of-domain 
data when building a model of the new, in-domain data [3]. 

Also in the area of machine learning, the method of View Completion [4] , has 
used collaborative tags to heuristically complete missing or inadequate feature 
sets (or views). The basic premise underlying view completion, is that for many 
tasks, combining multiple information sources yields significantly better results 
than using just a single one alone. Views are used in this case, since that blogs 
are not typically available on collaborative tagging websites, and as such the 
tags provided by bloggers suffer from the vocabulary problem and cannot be 
adequately used as a shared index. 

Another related area is Cross-System Personalization 5j which enables per- 
sonalization information across different systems to be shared. In digital libraries, 
cross system personalization is used to overcome the problem that information 
needed to support a personalized user experience is not shared among different 
libraries. In other work, the focus has been on adequate representations of [617] . 
and dependencies between [5] the user's profile, to support a unified represen- 
tation in the different systems. These approaches are ego-centric in that they 
assume the same user to exist across different systems; and that the user is in- 
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terested in an aggregated view of their profile or social networking information. 
This is not the case in an open social networking environment, where users are 
assumed to be similar (in some way), but have distinct digital identities. Our 
view is a socio-centric one and focuses on common patterns in the community 
as a whole. 

Work exists, in the area of association mining [9 10J . In [9] the goal is to 
provide a seamless navigation between tag spaces. The work presented in [lOj 
merges the areas of formal concept analysis and association rule mining to dis- 
cover shared conceptualizations that are hidden in folksonomies. They present 
a formalism for folksonomies that includes a set U of users, a set T of tags, and 
a set R of resources, represented by the ternary relation, Y. In our work, it is 
the set, T, of tags in the ternary relation that we propagate from one site to 
enrich another. Furthermore, we propose, that "citizen-defined" structuring (i.e. 
blogroll, friends, or comment networks) allow other types of ternary relations to 
be inferred, that are not restricted to the user-tag-resource triple. 

3 Open Social Networking: the Music Domain 

Open social networking, has two-fold goal: l)improve the structure of informa- 
tion within a single site and 2)exploit the social activities in a different sites to 
enhance the activities in comparable ones. Toward this end, two aspects are con- 
sidered: a representation for the activities within each social site; and mapping 
parts of the data and structures from one social site to another, to augment the 
social activities therein. 

3.1 Terminology 

We adopt the definition of a folksonomy as described by Hotho et al. [lOlllj . as 
a four-tupl|| F := (U, T, R, Y) , where: 

• U,T and R are finite sets, whose elements are called users, tags and resources, 
respectively, and 

• Y a ternary relation between them, i.e. Y C U xT x R, whose elements are 
called tag assignments. 

For the music domain, the set of resources are considered to be artists, tracks 
and albums. Additionally, we distinguish the different roles a site may have when 
describing the mappings between them. A target site, or in-domain site, is the 
one onto which data from another social site is mapped. The out- of- domain site 
is the social site from which data is extracted to augment the comparable, target 
site. The roles of in- and out-of-domain may be interchanged depending upon 
integration goals, and there may be multiple out-of-domain sites. For the purpose 
of this work, we consider Blogger.com and Last.fm to be the in-domain site 
and out-of-domain site, respectively. 

6 In the original definition [lOlllj . it is additionally introduced a subtag/supertag 
relation, which we omit for the purpose of this paper. 
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3.2 Cross-Site Enrichment 

We represent the Blogger.com music community conceptually as tuples, B := 

{{u b ,r b ) | (U b ,r b ) G U3i0gger.com X RBlogger.com}, since Blogger.COm bloggCrS do 

not tag the entities about which they write. On the other hand, the Last.fm 
social site is ripe with tag data of the form L := {(ui,ti,ri) | (it/,i/,r;) G 
U Last.fm x T Last . fm x R La st.fm}, representing the tag a given user has applied to 
a track within Last.fm. Then the mapping of tags onto Blogger.com is computed 
as follows: 

Y := {(u,t,r) | (u,i,r) G n UbA , rb (a rb=ri (B x L))} (1) 

were o and it are the relational algebra operators for selection and projection, 
respectively. 

First, from the cartesian product B x L, the tuples with equal resources in 
both sites are selected, and then the projection is taken over the Blogger.com 
users, the Last.fm tags, and the common resource elements. In general, the user 
sets are considered to be disjoint, i.e., UBiogger.com D U Las t.fm = 9. 

3.3 Site-Specific Enrichment 

The (hidden) relationships between blogs and/or bloggers can be exploited to 
infer relationships between the entities within the blog site. However, even within 
a single blog community, the relationships between resources, may not be well 
understood. For this reason, we undertake an exploratory analysis of the blogroll 
relationship. 

4 Experiments 

The experimental goals are to first examine the explicit blogroll structure; lay- 
ing the foundation for further analysis of an ideal or "optimal" resource-specific 
blogrolls. Resource specific blogrolls are those in which the nature of the blogroll 
is assumed to be explained in terms similarity in tastes for a given type of re- 
source, i.e., track, or artist. Then, to investigate the extent to which these opti- 
mal resource-specific blogrolls: overlap with the explicit blogrolls; and with each 
other. In the remainder of this discussion '"optimal"' resource-specific blogrolls 
is referred to as optimal blogrolls. 

4.1 Data Set 

For Cross System Music Blog Mining, we used two data sets: one data set 
consisted of personal music blogs from Blogger.com, one of the most popular 
blogsites, whereas the second data set consisted of tagged tracks from Last.fm, 
a radio and music community website and one of the largest social music plat- 
forms. The details of each data set are presented in this section. 
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Blogger.com Community: The blogroll relationship induces a network repre- 
senting a preferential reading of others people's blogs. The network data was 
collected by experimentally selecting seed bloggers using several music direc- 
tories and limiting the bloggers selected to the genre of pop and rock mu- 
sic in the Blogger.com domain. The blogroll for each seed was traversed, fan- 
ning out in a breath-first order, yielding a total number of bloggers equal to 

I ^ Blogger.com | — 976. 

Summary statistics for the overall structure and topological statistics for the 
largest five weak components are given in Table [TJ from there it can be seen that 
the components exhibit varying structural properties and that the structural 
view provided by the blogroll is a disjointed one. 

In addition to the community data, profiles were built by parsing the tracks 
in the user's blog and relying upon a dictionary of tracks gathered from Mu- 
sicBrainz.orj^l A total of 2196 unique tracks were collected; and for these tracks, 
a total 147801 Last.fm tags were obtained, which allowed us to construct the 
triples. 



Table 1: Blogger.com Community Statistics and Components 
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662 


1755 


0.492 


2.797 


8.134 


568 


2 


53 


62 


0.574 


2.113 


5.811 


13 


3 


28 


27 


0.964 


0.964 


1.892 





4 


19 


20 


0.526 


2.0 


4.684 
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16 


15 


0.751 


1.687 


3.187 






4.2 Experimental Results 

Blogroll Quality Based on Tracks and Tags We investigate the extent 
to which the explicit blogroll relationship, within the in-domain site, can be 

7 http: //www.musicblogscatalog . com/ 

http: //yocheckthisjam. com/music-blog- directory/ 

http: //www .blogged. com/ direct ory/ent ert ainment / mu s i c/r o ck | 

http : / /www . blogcatalog . com/ directory /music/rock 

8 http://www.musicbrainz.org 
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Table 2: Average similarity and overlap of the blogroll B and optimal blogroll 
B* 



Profile 


AvgSim(B) 


AvgSim(B*) 


Improvement (%) 


Avg(\BC\B*\) 


Track-Based 
Tag-Based 


0.295 
0.293 


0.547 
0.645 


85% 
120% 


1.48 « 1 blogger, with 
probability = 0.085 
1.37 « 1 blogger, with 
probability = 0.081 



described by the similarity between track and tag profiles. For each user u E U, 
i.e. blogger, a track-based vector profile (track profile) u is constructed such that 
u := {0, with the ith dimension U; set to 1 if the track r, E R u , appears 

in the user's blog, and otherwise. 

Alternatively, after including the tag information from the out-of-domain site, 
we constructed a profile based on tag annotations, which corresponds to the tag 
profile for the user, i.e., u := {0, 1}' T ', with the ith dimension set to 1 if the 
tag ti E T u , and otherwis^]. 

To evaluate the quality of the explicit blogroll B u of user u, we computed an 
average similarity score between a user and all the persons in his blogroll. We 
perform the similarity computation using the cosine-based measure: 

sim(u, v) := cos(u,v) := TTTj || ^) 

where u and v denote either the track or tag profiles of users u and v, 
respectively. 

For the optimal blogroll B* computation, we constructed a user-track (resp., 
user-tag) profile matrix and proceeded as follows: 

(i) Compute the user similarity matrix Stjj\x\u\ : — (sim(u,v)) 

(ii) Keep the highest k entries in each column of S. 

(iii) Set the optimal blogroll based on track profile (resp., tag profiles) for user 
u, i.e., B* R (resp., B* T ), to be the users in the non-zero columns of the 
corresponding u's row. 

In our experiments we set the value of parameter k — 10. Table [2] summarizes 
the results. 



B R and Overlap measures the extent to which optimal blogrolls, computed 
with track and tag profiles, agree on his members. We found that 77.66% of the 
time they agree on at least one member, and the average of the overlap in 
these cases is Avg(\B* R n B^\) — 4.64 w 5 bloggers out of 10 (the fixed size of 
the optimal blogrolls). The distribution of the intersection size for the optimal 
blogrolls is presented in Fig. [4] 

9 The 20000 most popular tags were used to build the profiles, i.e. \T\ = 20000 
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Fig. 2: Cumulative Frequency for the explicit blogroll B and the optimal B* 
computed using (a) Track Profiles and (b) Tag Profiles. 



4.3 Discussion 

From Table [51 it can be observed that the improvement, in terms of similarity, 
when computing B* based on tracks is 85%, and 120% when tag profiles are 
used. The table also shows that the overlap between the explicit and optimal 
blogrolls, computed either with track or tag profiles, occurs only 9% of the time, 
corresponding to an average of a single blogger in those cases. 

Furthermore, the optimal blogroll similarity distributions are better than the 
one produced by the explicit relationships, as shown in Fig. [31 which corresponds 
to the absolute frequencies of the explicit and optimal blogrols, for different 
values of similarity. The frequencies of optimal blogrolls, for similarity values 
over 0.5, are higher than the ones for explicit blogrolls. 

The cumulative frequency of blogrolls over discrete similarity bins is pre- 
sented in Fig. [21 which shows that both the track-based (49.27%) and tag-based 
(64.32%) optimal blogrolls exhibit good similarity quality, i.e., over 0.50, in con- 
trast to the respective explicit blogrolls, where just less than 11% of them in the 
case of track-based profiles (resp., 16.44% for tag-based) fall in bins correspond- 
ing to similarity values over 0.5. 
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Fig. 3: Frequency Distribution for the explicit blogroll B and the optimal B* 
computed using (a) Track Profiles and (b) Tag Profiles. 



Tag based computations perform better than using tracks, i.e., builds optimal 
blogrolls with higher values of similarity, this can be explained by the fact that 
tags capture some structure of the domain, e.g., genre, improving the overlap 
between user profiles when computing the similarity measure. 



5 Conclusions 



In this paper, we explored to what extent the knowledge and structures from 
one social site (out-of-domain) can be adequately exploited to provide new in- 
formation and resources to the users in a comparable social site (in-domain) , in 
particular for a music blog community within the Blogger.com social network. 
An examination of the explicit blogroll structure, which is assumed to express 
a preferential reading of others people's blogs, has revealed that bloggers tend 
to produce sub-optimal blogrolls when measuring the similarity between users 
based on track, as well as tags. The implication for this is that if users are inter- 
ested in learning about tracks, tags or other bloggers, some assistance to guide 
them is needed. On the other hand, neither tracks nor tags comes close to fully 
"explaining" the nature of the blogroll. 




However, by integrating the social activities of music bloggers and listeners, we 
were able to overcome this limitation. We have shown the improvement that 
Open Social Networking can have on the quality of blogrolls: Last.fm offers bet- 
ter optimal blogrolls, than the tracks alone, from Blogger.com, improving the 
quality of the blogroll neighborhoods, in terms of similarity, by 85% when using 
tracks and by 120% percent when integrating tags from the site. The higher 
value of similarities computed based on tags can be explained by the fact that 
tags capture some structure of the domain, e.g., genre, increasing the overlap 
probability and size on the user profiles. Though this do not necessarily mean 
that tags are better predictor of similarity than tracks, it strongly suggests that 
the kind of information captured by tags can be exploited effectively, comple- 
menting tasks or models where tracks are used alone. 

Although our investigation has provided promising results, we believe that our 
contribution is an initial step in the study of Open Social Networking, future 
work is required to evaluate the usefulness of optimal blogrolls, e.g., in providing 
recommendations. Furthermore, we plan to investigate the extent to which the 
explicit community bonds and "citizen-defined" structuring (i.e. blogroll, friends, 
or comment networks) can be described by mining and inferring associations be- 
tween the profiles of users across social sites, towards a more general model, that 
considers new dimensions beyond the ternary relation between users, tags and 
resources. 
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